Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

نویسندگان

Yixin Luo

Sriram Govindan

Bikash Sharma

Mark Santaniello

Justin Meza

Aman Kansal

Jie Liu

Badriddine M. Khessib

Kushagra Vaid

Onur Mutlu

چکیده

Recent studies estimate that server cost contributes to as much as 57% of the total cost of ownership (TCO) of a datacenter [1]. One key contributor to this high server cost is the procurement of memory devices such as DRAMs, especially for data-intensive datacenter cloud applications that need low latency (such as web search, in-memory caching, and graph traversal). Such memory devices, however, may be prone to hardware errors that occur due to unintended bit flips during device operation [40, 33, 41, 20]. To protect against such errors, traditional systems uniformly employ devices with highquality chips and error correction techniques, both of which increase device cost. At the same time, we make the observations that 1) data-intensive applications exhibit a diverse spectrum of tolerance to memory errors, and 2) traditional one-size-fits-all memory reliability techniques are inefficient in terms of cost. Our DSN-44 paper [30] is the first to 1) understand how tolerant different data-intensive applications are to memory errors and 2) design a new memory system organization that matches hardware reliability to application tolerance in order to reduce system cost. The main idea of our approach is to classify applications based on their memory error tolerance, and map applications to heterogeneous-reliability memory system designs managed cooperatively between hardware and software to reduce system cost. Our DSN-44 paper provides the following contributions:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance

This paper advocates a device-aware design strategy to improve various NAND flash memory system performance metrics. It is well known that NAND flash memory program/erase (PE) cycling gradually degrades memory device raw storage reliability, and sufficiently strong error correction codes (ECC) must be used to ensure the PE cycling endurance. Hence, memory manufacturers must fabricate enough num...

متن کامل

A Fault-Tolerant 176 Gbit Solid State Mass Memory Architecture

This paper presents a new Solid State Mass Memory (SSMM) suitable for space applications. The memory reliability is increased by using two different approaches. Firstly, memory mass fault-tolerance, with respect to hard failures, is obtained by using a fine-granularity hierarchical structure with a certain level of redundancy. A second strategy used for facing soji errors is based on Error Corr...

متن کامل

System Effects of Single Event Upsets

At the system level, SEUs in processors are controlled by fault-tolerance techniques such as replication and voting, watchdog processors, and tagged data schemes [13,16,30]. SEUs in memory subsystems are controlled by use of error control codes (ECCs) [4,17,21] and a process called scrubbing. The scrubbing process periodically reads each word in the memory. If the number of faulty digits in a w...

متن کامل

Replication for Efficiency and Fault Tolerance in a Dsm System

Distributed Shared Memory (DSM) systems implemented on a network of workstations (NOW) have become a convenient alternative to shared memory archi-tectures to execute long running parallel applications. However, such architectures are susceptible to experience failures. This paper presents the design and implementation of a recoverable DSM (RDSM) based on a backward error recovery (BER) mechani...

متن کامل

Partially-Forgetful Memories: Relaxing Memory Guard-bands for Approximate Computing

While the memory subsystem is already a major contributor to energy consumption of computing platforms, the guardbanding required for masking the effects of ever increasing manufacturing variations in memories imposes even more energy overhead. In this paper, we explore how PartiallyForgetful Memories can be used by exploiting the intrinsic tolerance of a vast class of applications to some leve...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1602.00729 شماره

صفحات -

تاریخ انتشار 2015

Heterogeneous-Reliability Memory: Exploiting Application-Level Memory Error Tolerance

نویسندگان

چکیده

منابع مشابه

Exploiting Memory Device Wear-Out Dynamics to Improve NAND Flash Memory System Performance

A Fault-Tolerant 176 Gbit Solid State Mass Memory Architecture

System Effects of Single Event Upsets

Replication for Efficiency and Fault Tolerance in a Dsm System

Partially-Forgetful Memories: Relaxing Memory Guard-bands for Approximate Computing

عنوان ژورنال:

اشتراک گذاری